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ABSTRACT 



Background: Hidden Markov models (HMM) are powerful machine learn- 
ing tools successfully applied to problems of computational Molecular Biol- 
ogy. In a predictive task, the HMM is endowed with a decoding algorithm in 
order to assign the most probable state path, and in turn the class labeling, 
to an unknown sequence. The Viterbi and the posterior decoding algorithms 
are the most common. The former is very efficient when one path dominates, 
while the latter, even though does not guarantee to preserve the automaton 
grammar, is more effective when several concurring paths have similar prob- 
abilities. A third good alternative is 1-best, which was shown to perform 
equal or better than Viterbi. 

Results: In this paper we introduce the posterior- Viterbi (PV) a new de- 
coding which combines the posterior and Viterbi algorithms. PV is a two 
step process: first the posterior probability of each state is computed and 
then the best posterior allowed path through the model is evaluated by a 
Viterbi algorithm. 

Conclusions: We show that PV decoding performs better than other algo- 
rithms first on toy models and then on the computational biological problem 
of the prediction of the topology of beta-barrel membrane proteins. 
Contacts: piero.fariselli@unibo.it 

Background 

Machine learning approaches have been shown to be very profitable in the 
field of computational Molecular Biology [Hj. Among them hidden Markov 
models (HMMs) have been proven to be especially successful when in the 
problem at hand regular grammar-like structures can be detected [3 E]. 
HMMs were developed for alignments [Ulllj, pattern detection [ISl E] and 
also for predictions, as in the case of the topology of all-alpha and all-beta 
membrane proteins [HllIlllinillllliniliniiaE]- 

When HMMs are implemented for predicting a given feature, a decoding 
algorithm is needed. With decoding we refer to the assignment of a path 
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through the HMM states (which is the best under a suitable measure) given 
an observed sequence O. In this way, we can also assign a class label to each 
sequence element of the emitting state [3 Ej . More generally, as stated in 
|13j . the decoding is the prediction of the labels of an unknown path. Labeling 
is routinely the only relevant biological property associated to the observed 
sequence; the states themselves may not represent a significant piece of in- 
formation, since they basically define the automaton grammar. 

The most famous decoding procedure is the Viterbi algorithm, which finds 
the most probable allowed path through the HMM model. Viterbi decoding 
is particularly effective when there is a single best path among others much 
less probable. When several paths have similar probabilities, the posterior 
decoding or the 1-best algorithms are more convenient ^3]- The posterior 
decoding assigns the state path on the basis of the posterior probability, 
although the selected path might be not allowed. For this reason, in order to 
recast the automaton constraints, a post-processing algorithm was applied 
to the posterior decoding |H]. 

In this paper we address the problem of preserving the automaton gram- 
mar and concomitantly exploiting the posterior probabilities, without the 
need of the post-processing algorithm |HlEj. Prompted by this, we design 
a new decoding algorithm, the posterior- Viterbi decoding (PV), which pre- 
serves the automaton grammars and at the same time exploits the posterior 
pobabilities. We show that PV performs better than the other algorithms 
when we test it on toy models and on the problem of the prediction of the 
topology of beta-barrel membrane proteins. 

Methods 

The hidden Markov model definitions 

For sake of clarity and compactness, in what follows we make use of explicit 
BEGIN and END states and we do not treat the case of the silent {null) 
states. Their inclusion in the algorithms is only a technical matter and can 
be done following the prescriptions indicated in jHl Ej . 
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An observed sequence of length L is indicated as {=0i...0l) both 
for a single-symbol-sequence (as in the standard HMMs) or for a vector- 
sequence as described before jTHj. label{s) indicates the label associated to 
the state s, while A (=Aj, . . . Aj^) is the list of the labels associated to each 
sequence position i obtained after the application of a decoding algorithm. 
Depending on the problem at hand, the labels may identify transmembrane 
regions, loops, secondary structures of proteins, coding/non coding regions, 
intergenic regions, etc. A HMM consisting of N states is therefore defined 
by three probability distributions 

Starting probabilities: 

aBEGiN,k = P{k\BEGIN) (1) 
Transition probabilities: 

afc,, = P{k\s) (2) 

Emission probabilities: 

efc(0,) = PiO,\k) (3) 

The forward probability is 

^{^ = P{0^,02...0,,7T, = k) (4) 

which is the probability of having emitted the first partial sequence up to i 

ending at state k. 

The backward probability is: 

hi{} = P{0,+i, . . . Ol-i, Ol\7t, = k) (5) 

which is the probability of having emitted the sequence starting from the last 
element back to the (i+l)th element, given that we end at position i in state 
k. The probability of emitting the whole sequence can be computed using 
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either forward or backward according to: 

P{0\M) = fEND{L + 1) = W/7v(0) (6) 

Forward and backward are also necessary for updating of the HMM param- 
eters, using the Baum- Welch algorithm jSl Ej . Alternative a gradient-based 
training algorithm can be applied |3| IT3]. 

Viterbi decoding 

Viterbi decoding finds the path (vr) through the model which has the maximal 
probability with respect to all the others [3 Ej . This means that we look for 
path which is 

Ti"" = argmax{T,^P{'K\0, M) (7) 

where 0(=0i, . . .Ol) is the observed sequence of length L and M is the 
trained HMM model. Since the P{0\M) is independent of a particular path 
TT, Equation is equivalent to 

n'" = arg'maX{^}P{7r,0\M) (8) 

P{7T,0\M) can be easily computed as 

L 

P{7T,0\M) = Y[ an{i-i),7T{i)e-^{i){0i) ■ a„{L),END (9) 

2 = 1 

where by construction -7r(0) is always the BEGIN state. 

Defining Vk{i) as the probability of the most likely path ending in state k 
at position i, and Pi{k) as the trace-back pointer, vr^' can be obtained running 
the following dynamic programming called Viterbi decoding 

• Initialization 

vbegin{0) = 1 Vk{0)=0 for ky^ BEGIN 
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• Recursion 



Vkii) = [max{vs{i - l)as,k)]ek{Oi) 
{■5} 

Pi{k) = argmaX{s}Vs{,i - ^)as,k 

• Termination 

P(0,7r^|M) 

Til = argmax{s}[vs{L)as,END] 

• Traceback 

7lU=p^{7^n for i = L...l 

• Label assignment 

Ai = label{TTi) for i = 1. . .L 

1-best decoding 

The 1-best labeling algorithm described here is the Krogh's previously de- 
scribed variant of the N-best decoding Since there is no exact algorithm 
for finding the most probable labeling, 1-best is an approximate algorithm 
which usually achieves good results in solving this task jTHj. Differently from 
Viterbi, the 1-best algorithm ends when the most probable labeling is com- 
puted, so that no trace-back is needed. 

For sake of clarity, here we present a redundant description, in which we 
define Hi as the set of all labeling hypothesis surviving as 1-best for each 
state s up to sequence position i. In the worst case the number of distinct 
labeling-hypothesis is equal to the number of states, hf is the current partial 
labeling hypothesis associated to the state s from the beginning to the i- 
th sequence position. In general several states may share the same labeling 
hypothesis. Finally, we use © as the string concatenation operator, so that 
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'AAAA'©'B'='AAAAB'. 1-best algorithm can then described as 

• Initialization 

vbegin{Q) = 1 t^fc(0) = for BEGIN 
Vk{l) = aBEGiN,k ■ efc(Oi) Hi = {label{k) : aBEGiN,k 0} 
Hi = ^ for i = 2,...L 

• Recursion 

Vk{i + 1) = maxhenAT^s^sii) ■ S{hl,h) ■ a^,fc)]efc(Oi) 
/if+i = argmaXheHi[Ls Vs{i) ■ S{hf, h) ■ a^,/,)] © label{k) 
Hi+i ^ Hi^i U {h^i+i} 

• Termination 

A = argmaXheHL J2 '"siL)S{hl, h)as,END 

s 

where we use the Kronecker's delta 6{a, b) (which is 1 when a = b, other- 
wise). With 1-best decoding we do not need keeping backtrace matrix since 
A is computed during the forward steps. 

Posterior decoding 

The posterior decoding finds the path which maximizes the product of the 
posterior probability of the states j21IZ|. Using the usual notation for forward 
ifki'i)) and backward {bk{i)) we have 

P(7r, = k\0, M) = fk{i)bk{i)/P{0\M) (10) 

The path which maximizes the posterior probability is then computed as 

irf = argmaxisjPiTTi = s\0, M) for i = l...L (11) 
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The corresponding label assignment is 

Ai = labelinf) for i = l...L (12) 

If we have more than one state sharing the same label, labeling can be im- 
proved by summing over the states that share the same label [posterior sum). 
In this way we can have a path through the model which maximizes the pos- 
terior probability of being in a state with label A when emitting the observed 
sequence element , or more formally: 

Ai = argmax{\] ^ P{7ii = s\0,M) for i = l...L (13) 

label{s)=X 

The posterior-decoding drawback is that the state path sequences vr^ or A 
may be not allowed paths. However, this decoding can perform better than 
Viterbi, when more than one high probable path exits [3 E] • In this 
post-processing algorithm that recast the original topological constraints is 
recommended |Hj. 

In the sequel, if not differently indicated, with the term posterior we mean 
the posterior sum. 

Posterior- Viterbi decoding 

Posterior- Viterbi decoding is based on the combination of the Viterbi and 
posterior algorithms. After having computed the posterior probabilities we 
use a Viterbi algorithm to find the best allowed posterior path through the 
model. A related idea, specific for pairwise alignments was previously intro- 
duced to improve the sequence alignment accuracy [H]. 

In the PV algorithm, the basic idea is to compute the path vr^^ 

L 

T,Py = argmax{^^A,} H ^(^d^, M) (14) 

1=1 

where Ap is the set of the allowed paths through the model, and P(7rj|0, M) 
is the posterior probability of the state assigned by the path vr at position i 
(as computed in Eq. ITn| . 
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Defining a function 6*{s, t) tliat is 1 if s ^ t is an allowed transition of the 
model M, otherwise, Vk{i) as the probability of the most probable allowed- 
posterior path ending at state k having observed the partial Oi, . . . Oj and 
Pi as the trace-back pointer, we can compute the best path ir^^ using the 
Viterbi algorithm 

• Initialization 

vbegin{0) = 1 Vk{0)=0 for BEGIN 

• Recursion 

Ufc(z) = max[vs{i - k)]P{7ri = k\0, M) 

Pi{k) = argmax {sjivsii — i)S*{s, k)] 

• Termination 

P{tt^^\M,0) = maXs[vs{L)5*{s,END)] 
vrf^ = argmax{s}[vsiL)5*{s, END)] 

• Traceback 

n[_\=p,inr) for t = L...l 

• Label assignment 

Ai = labeliirf^) for i = l...L 

Datasets 

Two different types of data are used to score the posterior- Viterbi algorithm, 
namely synthetic and real data. In the former case, we start with the simple 
occasionally dishonest casino illustrated in referred here as LF model 
(Figure Q); then we increase the complexity of the automaton with other 
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two models. First, we introduce the occasionally dishonest casino reported 
in Figure El and referred as L2F2, in which fair (label F) and loaded dice 
(label L) come always in pairs (or one die is always tossed twice). A third 
more complex version of the occasionally dishonest casino is shown in Figure 
ini (model L3F3). In L3-F3 the loaded dice are multiple of three (or one die is 
always tossed three times), while the number of fair tosses are at least three 
but can be more. 

Accordingly, for each toy model presented above (Figures 0121 and Ej), we 
produced 50 sequences of 300 dice outcomes and we trained the corresponding 
empty models (one for each models) using the Baum- Welch algorithm. The 
initial empty models have the same topology of the models LF, L2F2 and 
L3F3, with their emission and allowed transition probabilities set to the 
uniform distribution. 

After training, we tested the ability of different algorithms (Viterbi, 1- 
best and PV) to recover the original labeling from the observed sequence of 
numbers (the dice outcomes). 

The problem of the prediction of the all-beta transmembrane regions is 
used to test the algorithm on real data application. In this case we use a 
set that includes 20 constitutive beta-barrel membrane proteins whose se- 
quences are less than 25% homologous and whose 3D structure have been 
resolved. The number of beta-strands forming the transmembrane bar- 
rel ranges from 2 to 22. Among the 20 proteins 15 were used to train a 
circular HMM (described in jTHj), and here are tested in cross-validation 
(laOsP, IbxwA, le54, lek9A, IfcpA, Ifep, li78A, lk24, IkmoA, Iprn, lqd5A, 
IqjSA, 2mprA, 2omf, 2por). Since there is no detectable sequence identity 
among the selected 15 proteins, we adopted a leave-one-out approach for 
training the HMM and testing it. All the reported results are obtained 
during the testing phase, and the complete set of results is available at 
www.biocomp.unibo.it / piero /posvit . 

The other 5 new proteins (lmm4, Inqf, lp4t, luyn, ltl6) are used as a 
blind new test. 
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Measures of accuracy 

We used three indices to score the accuracy of the algorithms. The first one 
is Q2 which computes the number of correctly assigned labels divided by the 
total number of observed symbols. Then we use the SOV index [20] to evalu- 
ate the segment overlaps. Finally, in the case of the all-beta transmembrane 
proteins we adopt a very stringent measure called Qok'- a prediction is con- 
sidered correct only if the number of transmembrane segments coincides with 
the observed one and the corresponding segments have a minimal overlap of 
m residues [Hj. The value m is segment-dependent and for each segment 
pairs, is computed as 

m = min{\segpr\/'2.,\segob\/2] (15) 

where |se(7pr.| and |se(7o6| are the predicted and observed segment lengths, 
respectively. 

Results and Discussion 

Testing the decoding algorithms on toy models 

We start using one of the simplest HMM model that can be thought of [LF), 
which is the occasionally dishonest casino presented in |7] . LF can parse any 
kind of observed sequence of numbers ranging from 1 to 6 (the die faces), 
generated with loaded and fair dice. Based on the LF model we produced 
50 sequences with 300 dice outcomes and we trained an empty model with 
them. After this, we tested the three decoding algorithms that preserve the 
automaton grammar on the task of reconstructing the correct labeling. 

In Table ^ we show that the accuracy of the posterior- Viterbi is greater 
than those of the other two algorithms. It is worth noticing that with this 
simple model the posterior algorithm alone achieves a similar accuracy (data 
not shown). 

The L2F2 and L3F3 models, in which no one of the posterior decoding re- 
constructions is consistent with the automaton grammar (not parsable) are of 
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some interest. In this case, among the three grammar-preserving algorithms, 
the posterior- Viterbi is the best performing one. This is particularly true 
for the L3-F3 model, in which the SOV values highlight a quite good perfor- 
mance of PV. Considering that the three reconstructed models, as computed 
with the Baum- Welch, are very similar to the theoretical ones and indepen- 
dent from decoding, it is worth noticing the performance drop of the Viterbi 
algorithm. From these results it appears that in some cases the use of the PV 
decoding leads to a better performance given the same data and the same 
model. 

Testing the decoding algorithms on real data 

In order to test our decoding algorithm on real biological data, we used 
a previously developed HMM, devised for the prediction of the topology of 
beta-barrel membrane proteins ^H] • The hidden Markov model is a sequence- 
profile-based HMM and takes advantage of emitting vectors instead of sym- 
bols, as described in jTHj. 

Since the previously designed and trained HMM pH] emits profile vec- 
tors, sequence profiles have been computed from the alignments as derived 
with PSI-BLAST P on the non-redundant database of protein sequences 
(ftp://ftp.ncbi.nlm.nih.gov/blast/db/) . 

The results obtained using the four different decoding algorithms are 
shown in Table El where the performance is tested with a jack-knife valida- 
tion procedure for the first 15 proteins and as blind-test for the latter 5 (see 
Methods). It is evident that for the problem at hand the Viterbi decoding 
and the 1-best are unreliable, since only one of the proteins is correctly as- 
signed. In this case the posterior decoding is more efficient and can correctly 
assign 60% and 40% of the proteins, in cross-validation and on the blind 
set, respectively. Here the posterior decoding is used without MaxSubSeq , 
introduced before to recast the grammar jTH]. 

From Table Sit evident that the new PV decoding is the best performing 
decoding achieving 80% and 60% accuracy in cross-validation and on the 
blind set, respectively. This is done ensuring that predictions are consistent 
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with the designed automaton grammar. 
Comparison with other available HMMs 

Although this is out of the scope of this paper, the reader may be interested in 
seeing a comparison between our HMM-decoding with those obtained from 
the available web servers, based on similar approaches [3 Ej- In Table 0] 
we show the results. The tmbb server [2j allows the user to test three 
different algorithms, namely Viterbi, 1-best and posterior. Differently from 
us they find that their HMM does not show significant differences among 
the three decoding algorithms. This dissimilar behaviour may be due to 
several concurring facts: i) the different HMM models, ii) tmbb runs on a 
single-sequence input, in) tmbb is trained using the Conditional Maximum 
Likelihood |12j . 

The second server PROFtmb is based on a method that exploits 
multiple sequence information and posterior probabilities. Their decoding is 
related to the posterior- Viterbi; however, in their algorithm the authors first 
obtained the posterior sum contracted into two possible labeling (inner/outer 
loops and transmembrane as we did in ^Hj); then they made use of the ex- 
plicit value of the HMM transition probabilities (flij). In this way they count 
the transition probabilities twice (implicitly in the posterior-probability and 
directly into their algorithm) and the PROFtmb performance is not very 
different from ours. In our opinion, the fact that the newly implemented PV 
algorithm performs similarly or better, with respect to all indices, suggests 
that PV can be useful also when applied to the other HMM models. 

Conclusions 

The new PV decoding algorithm is more convenient in that overcomes the 
difficulties of introducing a problem- dependent optimization algorithm when 
the automaton grammar is to be re-cast. When one-state-path dominates 
we may expect that PV does not perform better than the other decoding 
algorithms, and in these cases the 1-best is preferred ^Hj- Nevertheless, we 
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show that when several concurring paths are present, as in the case of our 
beta-barrel HMM, PV performs better than the others. 

A performance similar to that obtained with PV decoding can be achieved 
using MaxSubSeq algorithm jS] on top of the posterior sum decoding. How- 
ever, although MaxSubSeq is a very general two-class segment optimization 
algorithm, PV is far more useful when the underlying predictor is a HMM, 
where more than two labels and different constraints can be introduced into 
the automaton grammars. 

Although PV takes a time longer than other algorithms (the posterior + 
the Viterbi time), the PV asymptotic computational time-complexity still 
remains, as for the other decodings 0{N'^ ■ L) (where L and N are the pro- 
tein length and the number of states, respectively). As far as the mem- 
ory requirement is concerned, PV needs the same space-complexity of the 
Viterbi and posterior {0{N ■ L)), while 1-best in the average case requires 
less memory, and can also be reduced jT3]. When computational speed is 
an issue, Viterbi algorithm is the fastest and the time complexity order is 
time{viterbi) < time^l — best) < time{PV) . 

Finally, PV satisfies any HMM grammar structures, including automata 
containing silent states, and it is applicable to all the possible HMM models 
with an arbitrary number of labels and without having to work out a problem- 
dependent optimization algorithm. 

List of abbreviations 

• HMM: hidden Markov model. 

• PV: Posterior- Viterbi. 

Authors' contributions 

PF developed the Posterior- Viterbi algorithm. PLM designed and trained 
the Hidden Markov Models. RC contributed to the problem. PF, PLM and 
RC authored the manuscript. 



14 



Acknowledgements 



This work was partially supported by the BioSapiens Network of Excellence, 
a grant of the Ministero della Universita e della Ricerca Scientifica e Tec- 
nologica (MURST), a grant for a target project in Biotechnology (CNR), a 
project on Molecular Genetics (CNR), a PRIN 2002 and a PNR 2001-2003 
(FIRB art. 8). 

References 

[1] Altschul,S.F., Madden, T.L., Schaffer,A.A., Zhang,J., Zhang,Z., 
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: 
A new generation of protein database search programs. Nucleic Acid 
Res., 25, 3389-3402. 

[2] Bagos,P.G., Liakopoulos,T.D., Spyropoulos,I.C., Hamodrakas,S.J. 
(2004) PRED-TMBB: a web server for predicting the topology of beta- 
barrel outer membrane proteins. Nucleic Acids Res., 32W400-W404. 

[3] Baldi,P. and Brunak,S. (2001) Bioinformatics: the Machine Learning 
Approach MIT Press. 

[4] Baldi,P., Chauvin,Y., Hunkapiller,T., and McClure,M.A. (1994) Hid- 
den Markov Models of Biological Primary Sequence Information, PNAS 
USA, 91, 1059-1063. 

[5] Bateman,A., Birney,E., Cerruti,L., Durbin,R., Etwiller,L., Eddy,S.R., 
Griffiths- Jones, S., Howe,K.L., Marshall, M. and Sonnhammer,E.L. 
(2002) The Pfam Protein Families Database, Nucleic Acids Research, 
30,276-280. 

[6] Bigelow,H.R., Petrey,D.S., Liu,J., Przybylski,D., and Rost,B. (2004) 
Predicting transmembrane beta-barrels in proteomes. Nucleic Acids 
Res., 32, 2566-2577. 



15 



[7] Durbin,R., Eddy,S., Krogh,A. and Mitchinson,G. (1998) Biological se- 
quence analysis: probabilistic models of proteins and nucleic acids. 
Cambridge Univ. Press, Cambridge. 

[8] Fariselli P., Finelli,M., Marchignoli,D., Martelli,P.L., Rossi,!, and Casa- 
dio,R. (2003) MaxSubSeq: an algorithm for segment-length optimiza- 
tion. The case study of the transmembrane spanning segments, Bioin- 
formatics 19,500-505. 

[9] Holmes,!., and Durbin,R. (1998) Dynamic programming alignment ac- 
curacy, J Comput Biol., 493-504. 

[10] Liu,Q., Zhu,Y.S., Wang,B.H., and Li,Y.X (2003) A HMM-based 
method to predict the transmembrane regions of beta-barrel membrane 
proteins. Comput Biol Chem. 27,69-76. 

[11] Krogh,A., Brown,M., Mian,!.S., Sjolander,K., and Haussler,D. (1994) 
Hidden Markov models in computational biology: Applications to pro- 
tein modeling. Journal of Molecular Biology, 235,1501-1531. 

[12] Krogh,A. (1994) Hidden Markov models for labeled sequences. !n Pro- 
ceedings 12th International Conference on Pattern Recognition. !EEE 
Comp. Soc. Press, Singapore, pp. 140-144. 

[13] Krogh,A. (1997) Two methods for improving performance of a HMM 
and their application for gene finding. Proceedings of the Fifth Interna- 
tional Conference on Intelligent Systems for Molecular Biology, pages 
179-186, Menlo Park, CA, AAA! Press 

[14] Krogh,A., Larsson,B., von Heijne,G. and Sonnhammer,EL. (2001) Pre- 
dicting transmembrane protein topology with a hidden Markov model: 
apphcation to complete genomes, J. Mol. Biol., 305, 567-580. 

[15] Mamitsuka,H. (1998) Predicting peptides that bind to MHC molecules 
using supervised learning of hidden Markov models. Proteins 33,460- 
474. 



16 



[16] Martelli,P.L., Fariselli,P., Krogh,A., Casadio,R. (2002) A sequence- 
profile-based HMM for predicting and discriminating beta barrel mem- 
brane proteins, Bioinformatics 18, S46-S53. 



[17] MartelIi,P.L., Fariselli,?., and Casadio,R. (2003) An ENSEMBLE ma- 
cfiine learning approacli for the prediction of all-alpha membrane pro- 
teins, Bioinformatics, 19,1205-1211. 

[18] Tusnady,G.E. and Simon,!. (1998) Principles governing amino acid 
composition of integral membrane proteins: application to topology 
prediction, J. Mol. Biol, 283, 489-506. 

[19] Viklund,H., and Elofsson,A. (2004) Best alpha-helical transmembrane 
protein topology predictions are achieved using hidden Markov models 
and evolutionary information. Protein Sci., 13,1908-1917. 

[20] Zemla,A., Venclovas,C., Fidehs,K., Rost,B. (1999) A modified defini- 
tion of Sov, a segment-based measure for protein secondary structure 
prediction assessment. Proteins, 34,220-223. 



17 



Table 1: Accuracy of the different algorithms on the toy-models 



Algorithms 


LF 


toy-models 
L2F2 


L3F3 


viterbi 








Q2 


0.80 


0.86 


0.47 


0\J V 


\j .'-to 


U. 1 o 


u.oo 


SOV(L) 


0.42 


0.64 


0.37 


1-best 








Q2 


0.80 


0.86 


0.88 


SOV 


0.48 


0.73 


0.81 


SOV(L) 


0.42 


0.64 


0.72 


posterior- Viterbi 








Q2 


0.82 


0.88 


0.90 


SOV 


0.66 


0.80 


0.82 


SOV(L) 


0.61 


0.75 


0.78 



For the indices see 'Measure of accuracy' section. SOV(L)= SOV computed for 
the loaded class only. 
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Table 2: Qok prediction accuracy obtained with the four different decoding 
algorithms on the real data 



Proteins Viterbi 1-best posterior posterior- Viterbi 



cross-validation 

laOspTOT . . _ OK 

IbxwaTOT - - OK OK 

le54 - - OK OK 

lekQaTOT - - OK OK 

IfcpaTOT . _ - 

IfepTOT . . _ OK 

li78a . . OK OK 

lk24 . . _ OK 

IkmoaTOT - - OK OK 

Iprn _ _ _ 

IqdSa - - OK OK 

IqjSa - - OK OK 

2mpra - - OK OK 

2omf - - OK OK 

2por _ _ - 

< Qok > 0.0 0.0 0.60 0.80 

blind-test 

lmm4 - - OK 

Inqf . . _ OK 

lp4t OK OK OK OK 

luyn _ - - OK 

ltl6 _ _ _ 

< Qok > 0.20 0.20 0.40 0.60 

Qok >■ see Measures of Accuracy. 
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Table 3: Posterior- Viterbi accuracy compared with other algorithms and 
HMM models 



Method 


Q2 


SOV 


SOV(BetaTM) 


SOV(Loop) 




cross-validahon 
















Posterior- Viterbi^ 


U.oz 


0. 


,87 


n no 

0.92 


0, 


.81 


0.80 


Viterbi^ 


0.63 


0. 


33 


0.27 


0, 


.35 


0.0 


1-best^ 


0.65 


0. 


37 


0.31 


0, 


.38 


0.0 


PROFTmb^ 
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1 Model taken from Martelli et al., 2002 ^ 

2 Bigelow et al., (2004) 

3 Bagos et al., 2004 

4 this is only referred to posterior- Viterbi decoding 
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Figure 1: Occasionally dishonest casino (Model LF). The emission proba- 
bilities of the fair state (F) are 1/6 for each possible outcome, while in the 
loaded die the emission probabilities are 1/2 for the '1' and 1/10 for the other 
faces. 
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Figure 2: Occasionally dishonest casino (Model L2F2). For the emission 
probabilities see Figure [TJ 
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Figure 3: Occasionally dishonest casino (Model L3F3). For the emission 
probabilities see Figure H 
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